The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), ‘patterned’ precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35). The data required for this practical session can be downloaded from Zenodo.
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens
transcriptome.
# Loading the data
load("./data_09_04_2024.RData")
# Composition of the data:
# myE_ge : raw gene expression count matrix
# info : sample annotation (data-frame)
# ttg : rows (genes) annotation
# Focus on CTRL samples for this session
sel_samples <- which(info$mutant=="CTRL")
myE_ge <- myE_ge[,sel_samples]
info <- info[sel_samples,]
info$group <- factor(paste(info$Fraction,info$DIV,sep="_"),levels=unique(paste(info$Fraction,info$DIV,sep="_")))
# Make some nice colors to facilitate the visualisation of time-points
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))
How many samples is there in your count data?
Does this correspond to your experimental protocol?
Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file.
What are the covariates?
how many rows?
Check that the rowID of the data-count corresponds to the ensembl gene ID in your gene annotation file.
#View(info)
#nrow(info)
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
match(colnames(myE_ge),info$sampleID)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#Check that the colnames of the data-count corresponds to your sampleID in the sample annotation file:
sum(is.na(match(rownames(myE_ge),ttg$ens_gene)))
## [1] 0
How does your data look-like?
par(mfrow=c(2,3))
plot(density(myE_ge[,1]),main="Read count distribution in sample 1")
boxplot(myE_ge,las=1,ylab="raw gene count",main="with outliers")
boxplot(myE_ge,outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
# t() computes the transpose
plot(density(t(myE_ge[1,])),main="Read count distribution across samples")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),las=1,ylab="raw gene count",main="with outliers")
boxplot(t(myE_ge[c(1,10,45,100,200,3000,4000),]),outline=FALSE,las=1,ylab="raw gene count",main="without outliers")
Analysis of the variance of the gene count across the \(N\) samples: \(\sigma^2=Var(X)=\frac{1}{n}\sum_{i=1}^{n}(x_i-\mu)^2\) and \(\mu=\frac{1}{n}\sum_{i=1}^n\)x_i:
Let’s have a close look at how the variance in gene expression scale with average and then correct it with some transformation.
#calculate mean and variance of the rows
row_avg <- apply(myE_ge,1,mean)
row_var <- apply(myE_ge,1,var)
#Log-transform your count data
myE_gel <- log2(1+myE_ge)
#DESeq2 variance stabilisation
vsd <- DESeq2::varianceStabilizingTransformation(matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE))
par(mfrow=c(1,3))
#Variance scale with average --> Poisson distribution
plot(row_avg,row_var,las=1,main="RAW data",pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="mean read count",ylab="variance read count",xlim=c(0,5000),ylim=c(0,10^7))
grid()
plot(apply(myE_gel,1,mean),apply(myE_gel,1,var),las=1,main="Log2-transformed data",pch=19,col=rgb(0,0,0,0.1),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()
plot(apply(vsd,1,mean),apply(vsd,1,var),las=1,main="DESeq2 variance stabilised",pch=19,col=rgb(0,0,0,0.05),cex=0.5,xlab="mean read count",ylab="variance read count")
grid()
Let’s first have a look at the distribution of gene expression for a few samples:
par(mfrow=c(1,3))
geneplotter::multidensity(myE_gel[,c(1,2,4,10)],main="Read count distributions",las=1,xlab="read count [log2]")
plot(density(myE_gel[,1]),las=1,main=colnames(myE_gel)[1],las=1,xlab="read count [log2]")
plot(density(myE_gel[,3]),las=1,main=colnames(myE_gel)[3],las=1,xlab="read count [log2]")
We can identify reliably expressed genes by fitting a bimodal distribution to the log2-read count distribtution of each samples to discriminate between the foreground and the background transcription. The limit must be fitted to each individual sample given that the library size will impact this factor.
We can now select for the reliably expressed genes in each sample:
Let’s first look at the ditribution of the gene count across a few
samples:
There are several options for normalisation. Scaling factors,
quantile normalisation etc.
What is the effect on the gene count?
Hierarchical clustering of the samples is frequently used to analyse whether similar samples cluster together.
Let’s first compare the clustering of the samples using the Manhattan
distance and Ward D algorithm:
Samples are very similarly clustered in Quantile and Deseq2 normalised methods. Let’s compare the different effect of the distances used as well as agglomerative alorithm. From now we will use the quantile normalised matrix:
Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
In this task, you will perform a differential gene expression using Deseq2.
Answer: The aim is to conduct a differential gene expression analysis using Deseq2.This tool enables the identification of genes showing differential expression across various experimental conditions or groups. Our objective is to compare each time point to the with the baseline time, DIV 0, determining the count of genes exhibiting significant upregulation and downregulation.
Every passage and my decisions are explained in the comments.
# Store the data, colnames, rownames and info
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
# Create the levels also for fraction
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))
# Design formula is used to estimate dispersion and log2 fold changes
# In info, it is possible to see that there are only two variables giving the variability: DIV and Fraction.
# group is just the "union" of the two.
# Since Fraction is giving variability in the data, but we want only to see the variability due to the timepoints, I use Fraction + DIV and then select just the results relative to DIV
# In this case, DIV is the factor of interest and Fraction is another source of variation in the data, so
# DIV is put as the second variable
design <- as.formula(~ Fraction + DIV)
model.matrix(design, data=info)
## (Intercept) FractionCytoplasmic DIV3 DIV7 DIV14 DIV22 DIV35
## 5 1 0 0 0 0 0 0
## 17 1 0 0 0 0 0 0
## 29 1 0 0 0 0 0 0
## 41 1 0 0 0 0 0 0
## 7 1 0 1 0 0 0 0
## 19 1 0 1 0 0 0 0
## 31 1 0 1 0 0 0 0
## 43 1 0 1 0 0 0 0
## 9 1 0 0 1 0 0 0
## 21 1 0 0 1 0 0 0
## 33 1 0 0 1 0 0 0
## 45 1 0 0 1 0 0 0
## 11 1 0 0 0 1 0 0
## 23 1 0 0 0 1 0 0
## 35 1 0 0 0 1 0 0
## 47 1 0 0 0 1 0 0
## 1 1 0 0 0 0 1 0
## 13 1 0 0 0 0 1 0
## 25 1 0 0 0 0 1 0
## 37 1 0 0 0 0 1 0
## 3 1 0 0 0 0 0 1
## 15 1 0 0 0 0 0 1
## 27 1 0 0 0 0 0 1
## 39 1 0 0 0 0 0 1
## 4 1 1 0 0 0 0 0
## 16 1 1 0 0 0 0 0
## 28 1 1 0 0 0 0 0
## 40 1 1 0 0 0 0 0
## 6 1 1 1 0 0 0 0
## 18 1 1 1 0 0 0 0
## 30 1 1 1 0 0 0 0
## 42 1 1 1 0 0 0 0
## 8 1 1 0 1 0 0 0
## 20 1 1 0 1 0 0 0
## 32 1 1 0 1 0 0 0
## 44 1 1 0 1 0 0 0
## 10 1 1 0 0 1 0 0
## 22 1 1 0 0 1 0 0
## 34 1 1 0 0 1 0 0
## 46 1 1 0 0 1 0 0
## 12 1 1 0 0 0 1 0
## 24 1 1 0 0 0 1 0
## 36 1 1 0 0 0 1 0
## 48 1 1 0 0 0 1 0
## 2 1 1 0 0 0 0 1
## 14 1 1 0 0 0 0 1
## 26 1 1 0 0 0 0 1
## 38 1 1 0 0 0 0 1
## attr(,"assign")
## [1] 0 1 2 2 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Fraction
## [1] "contr.treatment"
##
## attr(,"contrasts")$DIV
## [1] "contr.treatment"
# Create an object of class DESeqDataSet
# We select only the genes that are expressed global
dds <- DESeq2::DESeqDataSetFromMatrix(countData = mycountData[is_expressed_global,],
colData = info,
design= design)
# Run the function DEseq2(), which will compute the following:
# 1. Estimating size factors
# 2. Estimating dispersions
# 3. Gene-wise dispersion estimates
# 4. Mean-dispersion relationship
# 5. Final dispersion estimates
# 6. Fitting model and testing
dds <- DESeq2::DESeq(dds)
# Lists the coefficients
mycoeff <- resultsNames(dds)
# Extract the coefficients related to DIV
mycoeff_DIV <- mycoeff[startsWith(mycoeff, "DIV")]
# Function to extract the results
results_computation <- function(name) {
res <- results(dds, name = name)
return (res)
}
# Extract the results for the comparison between time points
DIV_results <- sapply(mycoeff_DIV, results_computation)
# Function to find the number of upregulated genes
genes_upregulated_computation <- function(res) {
# To select the upregulated genes, we consider a log2FC higher than 1, and a adjusted p-value lower than 0.01
# The adjusted p-value is obtained with Benjamini and Hochberg method
genes_up <- as.character(rownames(res))[res$log2FoldChange>1.0&res$padj<0.01]
return (genes_up)
}
# Function to find the number of downregulated genes
genes_downregulated_computation <- function(res) {
# To select the downregulated genes, we consider a log2FC lower than -1, and a adjusted p-value lower than 0.01
# The adjusted p-value is obtained with Benjamini and Hochberg method
genes_do <- as.character(rownames(res))[res$log2FoldChange<(-1.0)&res$padj<0.01]
return (genes_do)
}
# Find the upregulated and downregulated genes
genes_upregulated <- sapply(DIV_results, genes_upregulated_computation)
genes_downregulated <- sapply(DIV_results, genes_downregulated_computation)
# Find the number of upregulated and downregulated genes
genes_upregulated_number <- sapply(genes_upregulated, length)
genes_downregulated_number <- sapply(genes_downregulated, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))
# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of upregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 4800))
text(x = up_bar, y = genes_upregulated_number + 0.5, label = genes_upregulated_number, pos = 3)
# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of downregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 4800))
text(x = down_bar, y = genes_downregulated_number + 0.5, label = genes_downregulated_number, pos = 3)
Comment: The number of differential expressed genes increases
progressively with time, with the biggest step between DIV_14_vs_0 and
DIV_22_vs_0. Since we are comparing the baseline (DIV 0) pluripotent
stem cells (iPSCs), with cells that are gradually becoming more
specialized, the growing number of genes with different expression
levels over time is expected. As stem cells differentiate into specific
cell types, the expression of genes changes to carry out their new
specific functions.
#Volcano plot
VolcanoPlot <- function(myDGE=res, col_log2FC=res$log2FoldChange, col_pval=res$padj, title){
mycol_genes <- rep(rgb(0.7,0.7,0.7,0.2))
mycol_genes[abs(col_log2FC)>=1.0&col_pval<=0.05] <- rep(rgb(0.0,0.0,0.0,0.2))
plot(col_log2FC,-log10(col_pval),pch=19,cex=0.3,col=mycol_genes,xlab="log2FC",ylab="-log10(P-value)", main=title, frame=FALSE, ylim=c(0, 270))
abline(h=-log10(0.01),lty=2,col="grey")
abline(v=c(-1,1),lty=2,col="grey")
}
#Volcano plots
par(mfrow=c(1,5))
invisible(sapply(mycoeff_DIV, function(x){VolcanoPlot(myDGE=DIV_results[[x]], col_log2FC=DIV_results[[x]]$log2FoldChange, col_pval=DIV_results[[x]]$padj, title=x)}))
Comment: This plot shows the number of differentially expressed genes
(corresponding to the dots). The upregulated genes have a positive
(greater than 1) log2FC, while the dowregulated genes have a negative
(smaller than -1) log2FC. This plot also shows how genes are distributed
based on two measures: adjusted p-value (calculated using the Benjamini
and Hochberg method) and the log2 fold change. We can see that as time
progresses (compared to the baseline DIV 0), two things happen: the
total number of differentially expressed genes increases (the dots), and
there’s also a rise in the number of genes with a very low adjusted
p-value (high -log10 p-value).
# Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE, sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE) # please note this is going to create an interactive plot
}
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3 = genes_upregulated[[1]]
GO_analysis(genes_upregulated_DIV3)
# Find the biological pathways associated with the downregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3 = genes_downregulated[[1]]
GO_analysis(genes_downregulated_DIV3)
Comment: Considering GO:BP (Gene Ontology - Biological Process), the upregulated genes seem to be linked to developmental processes (system development, anatomical structure development, developmental processes, nervous system development), while the downregulated genes seem to be linked to signaling processes (cell communication, signaling, signal transduction). Considering KEGG, we find the Wnt signaling pathway in the upregulated, and neuroactive ligand-receptor interaction in the dowregulated, as the ones with more statistical significance. The upregulation of developmental processes drives cellular differentiation (from hiPSCs at DIV 0 to NPCs at DIV 3). NPCs dowregulated general communication and signaling pathways, but upregulate more specific signaling pathways, in this case wingless (WNT) signaling pathway. By analyzing the enriched pathways, we can gain a deeper understanding of genes and processes involved in cellular differentiation of hiPSCs (human induced pluripotent stem cells, DIV 0) cinto NPCs (neural precursor cells, DIV 3).
Do the same as task 1 but using EdgeR – Likelihood ratio tests. Comment on the differences between the techniques.
Answer: As before, please see the comments to understand the passages and my choices. Two different tests are techniques: edgeR with likelihood ratio tests and edgeR with quasi-likelihood F-tests.
# Store the data, colnames, rownames and info
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))
# I used the same design as before, in order to be able to compare the different methods
design <- as.formula(~ Fraction + DIV)
# We select only the genes that are expressed global
y <- DGEList(counts=mycountData[is_expressed_global,], group=info$group)
design <- model.matrix(design, data=info)
y <- estimateDisp(y,design)
# To perform likelihood ratio tests:
# Fit the model
fit <- glmFit(y,design=design)
# Get the coefficients
mycoefs <- colnames(fit$design)
# list of coefficients we are interested in
coefficients <- c(3, 4, 5, 6, 7)
# Function to find the number of upregulated genes
genes_upregulated_computation_edgeR <- function(coeff) {
lrt.nuc <- glmLRT(fit, coef=coeff)
res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
return (genes_up)
}
# Function to find the number of downregulated genes
genes_downregulated_computation_edgeR <- function(coeff) {
lrt.nuc <- glmLRT(fit, coef=coeff)
res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]
return (genes_do)
}
# Find the upregulated and downregulated genes
genes_upregulated_edgeR <- sapply(coefficients, genes_upregulated_computation_edgeR)
genes_downregulated_edgeR <- sapply(coefficients, genes_downregulated_computation_edgeR)
# Find the number of upregulated and downregulated genes
genes_upregulated_number_edgeR <- sapply(genes_upregulated_edgeR, length)
genes_downregulated_number_edgeR <- sapply(genes_downregulated_edgeR, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))
# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number_edgeR,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of upregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 5000))
text(x = up_bar, y = genes_upregulated_number_edgeR + 0.5, label = genes_upregulated_number_edgeR, pos = 3)
# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number_edgeR,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of downregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 5000))
text(x = down_bar, y = genes_downregulated_number_edgeR + 0.5, label = genes_downregulated_number_edgeR, pos = 3)
Comment: With edgeR, we find similar results to the previous method (DESeq2). Even in this case it is possible to see that the biggest step is between DIV 14 vs 0 and DIV 22 vs 0. With DESeq2, we find more upregulated genes compared to edgeR, while with edgeR we find more dowregulated genes compared to DESeq2.
#Volcano plots
par(mfrow=c(1,5))
invisible(sapply(coefficients, function(coeff) {
# Perform glmLRT and topTags computations
lrt.nuc <- glmLRT(fit, coef=coeff)
res <- topTags(lrt.nuc, n=sum(is_expressed_global))$table
# Generate VolcanoPlot
VolcanoPlot(myDGE=res, col_log2FC = res$logFC, col_pval = res$FDR, title = mycoefs[coeff])
}))
Comment: As timepoint (compared to the initial one, zero) increases, the number of differential expressed genes increases and also the number of genes with a very low adjusted p-value increases, just as in the previous method. However, in DESeq2 the -log of adjusted p-values were slightly higher than with edgeR, meaning that edgeR is more conservative than DESeq2.
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3_edgeR = genes_upregulated_edgeR[[1]]
GO_analysis(genes_upregulated_DIV3_edgeR)
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3_edgeR = genes_downregulated_edgeR[[1]]
GO_analysis(genes_downregulated_DIV3_edgeR)
Comment: The results found using edgeR are very similar to the ones using DESeq2 for the GO enrichment: the uperegulated genes are related to developmental processes, while the downregulated genes are linked to signaling processes.
# Store the data, colnames, rownames and info
mycountData <- matrix(as.integer(myE_ge),nrow=nrow(myE_ge),ncol=ncol(myE_ge),byrow = FALSE)
colnames(mycountData) <- colnames(myE_ge)
rownames(mycountData) <- rownames(myE_ge)
# In the levels, it is important to put "0" as first, so the analysis uses it as base line to confront the others
info$DIV <- factor(info$DIV,levels=c("0","3","7","14","22","35"))
info$Fraction <- factor(info$Fraction, levels=c("Nuclear", "Cytoplasmic"))
# We select only the genes that are expressed global
y <- DGEList(counts=mycountData[is_expressed_global,],group=info$group)
# I used the same design as before, in order to be able to compare the different methods
design <- as.formula(~ info$Fraction + info$DIV)
design <- model.matrix(design)
y <- estimateDisp(y,design)
# To perform quasi-likelihood F-tests:
# Fit the model
fit <- glmQLFit(y,design)
# Get the coefficients
mycoefs <- colnames(fit$design)
# list of coefficients we are interested in
coefficients <- c(3, 4, 5, 6, 7)
# Function to find the number of upregulated genes
genes_upregulated_computation_edgeR_QL <- function(coeff) {
qlf.nuc <- glmQLFTest(fit, coef=coeff)
res<- topTags(qlf.nuc,n=sum(is_expressed_global))$table
genes_up <- as.character(rownames(res))[res$logFC>1.0&res$FDR<0.01]
return (genes_up)
}
# Function to find the number of downregulated genes
genes_downregulated_computation_edgeR_QL <- function(coeff) {
qlf.nuc <- glmQLFTest(fit, coef=coeff)
res<- topTags(qlf.nuc,n=sum(is_expressed_global))$table
genes_do <- as.character(rownames(res))[res$logFC<(-1.0)&res$FDR<0.01]
return (genes_do)
}
# Find the upregulated and downregulated genes
genes_upregulated_edgeR_QL <- sapply(coefficients, genes_upregulated_computation_edgeR_QL)
genes_downregulated_edgeR_QL <- sapply(coefficients, genes_downregulated_computation_edgeR_QL)
# Find the number of upregulated and downregulated genes
genes_upregulated_number_edgeR_QL <- sapply(genes_upregulated_edgeR_QL, length)
genes_downregulated_number_edgeR_QL <- sapply(genes_downregulated_edgeR_QL, length)
# Create a layout with 1 row and 2 columns
par(mfrow=c(1, 2))
# Bar plot of upregulated genes
up_bar <- barplot(genes_upregulated_number_edgeR_QL,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of upregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 5000))
text(x = up_bar, y = genes_upregulated_number_edgeR_QL + 0.5, label = genes_upregulated_number_edgeR_QL, pos = 3)
# Bar plot of downregulated genes
down_bar <- barplot(genes_downregulated_number_edgeR_QL,
names.arg=mycoeff_DIV,
xlab="Comparison between time points",
ylab="Number of genes",
main="Number of downregulated genes",
col=mycols_days[2:length(mycols_days)],
cex.names=0.7,
ylim=c(0, 5000))
text(x = down_bar, y = genes_downregulated_number_edgeR_QL + 0.5, label = genes_downregulated_number_edgeR_QL, pos = 3)
Comment: Here the results are slightly less than edgeR with likelihood tests, because quasi-likelihood tests are more conservatives. We find the same trends as before.
#Volcano plots
par(mfrow=c(1,5))
invisible(sapply(coefficients, function(coeff) {
# Perform glmQLFTest and topTags computations
qlf.nuc <- glmQLFTest(fit, coef=coeff)
res <- topTags(qlf.nuc, n=sum(is_expressed_global))$table
# Generate VolcanoPlot
VolcanoPlot(myDGE=res, col_log2FC = res$logFC, col_pval = res$FDR, title = mycoefs[coeff])
}))
Comment: The values of -10 log of adjusted p-values are quite smaller
compared to the previous two methods. In fact, edgeR calculates p-values
via the quasi-likelihood method, while DESeq uses Wald tests. For this
reason, edgeR with the quasi-likelihood method is more conservative than
the previous methods.
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_upregulated_DIV3_edgeR_QL = genes_upregulated_edgeR_QL[[1]]
GO_analysis(genes_upregulated_DIV3_edgeR_QL)
# Find the biological pathways associated with the upregulated genes in DIV 3 compared to DIV 0
genes_downregulated_DIV3_edgeR_QL = genes_downregulated_edgeR_QL[[1]]
GO_analysis(genes_downregulated_DIV3_edgeR_QL)
Comment: Also in this case, the pathways found looking at the upregulated genes are connected to the development, while the ones found looking at the downregulated genes are connected to cell communication and signaling pathways.